Skip to content

Refresh runner certificate on start when listen IP changes#848

Merged
evanphx merged 2 commits into
mainfrom
mir-1223-miren-app-run-and-sandbox-exec-fail-on-garden
Jun 6, 2026
Merged

Refresh runner certificate on start when listen IP changes#848
evanphx merged 2 commits into
mainfrom
mir-1223-miren-app-run-and-sandbox-exec-fail-on-garden

Conversation

@evanphx

@evanphx evanphx commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Problem

miren app run and miren sandbox exec fail against distributed runners with:

tls: failed to verify certificate: x509: certificate is valid for
127.0.0.1, ::1, 10.128.0.45, not 10.128.0.47

Root cause: runner VMs are destroyed and recreated with a persistent disk mounted at /var/lib/miren. That disk keeps runner/config.yaml — the runner ID and the certificate issued at first join. On recreate the VM gets a new internal IP, so:

  • The runner ID persists → the runner reattaches to its existing Node and setupEntity rewrites Node.ApiAddress to the new IP (so sandbox exec correctly dials the new IP), but
  • The certificate persists → it's still the one minted for the old IP and is never re-issued → TLS server verification fails.

Fix

Add a RefreshCertificate RPC. On start, the runner inspects its persisted certificate and, only if the discovered listen address isn't covered by the cert's SANs, asks the coordinator to re-issue, writes the new cert back to config, and serves it. This self-heals existing broken runners on their next restart.

A required-but-failed refresh is fatal — serving a stale cert would leave the runner silently unreachable.

Authorization (entirely cert-derived, never from caller input)

  1. A client certificate must be presented (mTLS peer cert).
  2. It must chain to the cluster CA and be currently valid (VerifyCert).
  3. It must be a runner certificate (CN runner-*, org miren).
  4. Exactly one registered Node must match the CN's runner-ID prefix.

Step 4 ties refresh to live membership. caauth has no revocation, so this prevents a removed runner's still-valid certificate from perpetually renewing itself. It fails closed on zero or ambiguous matches (two runners sharing an 8-char ID prefix → rejected, not guessed).

Supporting changes

  • pkg/rpc: expose the verified peer certificate to handlers via CurrentConnectionInfo; add ContextWithConnectionInfo (used by the server and tests).
  • pkg/caauth: add Authority.VerifyCert for an already-parsed certificate.
  • Factor the runner cert CommonName (runnerCertName) and SAN construction (buildRunnerSANs) into shared helpers reused by Join.

Tests

  • reissueRunnerCertificate: happy path (new IP in SANs, CA-signed, CN preserved, via a real Join); rejections for nil peer, foreign-CA cert, non-runner CN, wrong org, unregistered cert, removed runner, and ambiguous identity.
  • RefreshCertificate RPC rejects a call with no client certificate.
  • buildRunnerSANs (IP vs DNS host) and certCoversListenAddr (covered/changed IP, DNS host, bad PEM).

go build ./... and make lint are clean.

Note: also fixed a latent test-harness bug — newTestServer created its CA without ValidFor, so the root was instantly expired. It went unnoticed because no prior test verified a chain; the new tests do.

Follow-ups (not in this PR)

  • Encoding the full runner ID in the certificate at join would remove the residual 8-char-prefix ambiguity edge entirely.
  • Optional: always-refresh on start / periodic renewal to also pick up CA rotation.

Fixes MIR-1223.

Distributed runner VMs are recreated with a persistent disk mounted at
/var/lib/miren, which keeps runner/config.yaml (the runner ID and the
certificate issued at join). On recreate the VM gets a new internal IP:
the runner reattaches to its existing Node and rewrites ApiAddress to the
new IP, but it keeps serving the certificate minted for the old IP. Clients
(sandbox exec, app run) then fail TLS verification:

  tls: failed to verify certificate: x509: certificate is valid for
  ..., 10.128.0.45, not 10.128.0.47

Add a RefreshCertificate RPC so a runner can re-issue its certificate with
SANs covering its current listen address. On start, the runner inspects its
persisted certificate and, only if the discovered listen address is not
covered, calls RefreshCertificate, writes the new cert back to config, and
serves it. A required-but-failed refresh is fatal: serving a stale cert
would leave the runner silently unreachable.

Authorization is derived entirely from the caller's verified certificate,
never from caller-supplied input:
  - the peer cert must chain to the cluster CA and be currently valid
  - it must be a runner cert (CN "runner-*", org "miren")
  - exactly one registered Node must match the CN's runner-ID prefix
The last check ties refresh to live membership (caauth has no revocation),
so a removed runner's still-valid cert cannot perpetually renew itself; it
fails closed on zero or ambiguous matches.

Supporting changes:
  - pkg/rpc: expose the verified peer certificate to handlers via
    CurrentConnectionInfo, plus ContextWithConnectionInfo for tests
  - pkg/caauth: add Authority.VerifyCert for an already-parsed certificate
  - factor the runner cert CommonName and SAN construction into shared
    helpers reused by Join
@evanphx evanphx requested a review from a team as a code owner June 5, 2026 16:52
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ab7f923-e68b-4d47-9ebd-cee299183100

📥 Commits

Reviewing files that changed from the base of the PR and between 6371d96 and e4b7133.

📒 Files selected for processing (4)
  • api/runner/rpc.yml
  • api/runner/runner_v1alpha/rpc.gen.go
  • servers/runner/registration.go
  • servers/runner/registration_test.go
💤 Files with no reviewable changes (1)
  • servers/runner/registration_test.go
✅ Files skipped from review due to trivial changes (1)
  • api/runner/runner_v1alpha/rpc.gen.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/runner/rpc.yml

📝 Walkthrough

Walkthrough

This PR implements runner certificate refresh to support dynamic listen address changes. The feature adds a new RefreshCertificate RPC method to allow runners to request updated server certificates with new SANs when their listen address changes. The implementation includes RPC contract definition and generated bindings, peer certificate propagation through RPC context, server-side authorization via mTLS client certificate verification with registration lookup, and client-side certificate validation during runner startup that automatically triggers refresh when needed.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/rpc/server.go`:
- Around line 971-974: handleCalls only sets PeerSubject in the context for
unary RPCs, so ConnectionInfo(ctx).PeerCertificate is nil for v.Call(...) (used
by RefreshCertificate); update the unary path in handleCalls to populate the
same CurrentConnectionInfo fields as the streaming path: set PeerSubject to
r.TLS.PeerCertificates[0].Subject.String() and PeerCertificate to
r.TLS.PeerCertificates[0] (use the same ContextWithConnectionInfo and
CurrentConnectionInfo types), and add defensive checks that r.TLS != nil and
len(r.TLS.PeerCertificates) > 0 before accessing index 0 to avoid panics.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ebb42c5-cb0e-465b-b6b6-c340f357ef46

📥 Commits

Reviewing files that changed from the base of the PR and between f4c8ecf and 6371d96.

📒 Files selected for processing (9)
  • api/runner/rpc.yml
  • api/runner/runner_v1alpha/rpc.gen.go
  • cli/commands/runner_start.go
  • cli/commands/runner_start_test.go
  • pkg/caauth/caauth.go
  • pkg/rpc/server.go
  • pkg/rpc/state.go
  • servers/runner/registration.go
  • servers/runner/registration_test.go

Comment thread pkg/rpc/server.go
Comment on lines +971 to 974
ctx = ContextWithConnectionInfo(ctx, &CurrentConnectionInfo{
PeerSubject: r.TLS.PeerCertificates[0].Subject.String(),
PeerCertificate: r.TLS.PeerCertificates[0],
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Peer certificate propagation is only fixed for streaming RPCs, not unary RPCs.

RefreshCertificate is invoked via unary v.Call(...), but handleCalls still stores only PeerSubject in context. That leaves ConnectionInfo(ctx).PeerCertificate nil in unary handlers, causing cert-based authorization to fail.

Suggested fix
diff --git a/pkg/rpc/server.go b/pkg/rpc/server.go
@@
 		if r.TLS != nil && len(r.TLS.PeerCertificates) > 0 {
 			call.peer = r.TLS.PeerCertificates[0]
-			ctx = context.WithValue(ctx, connectionKey{}, &CurrentConnectionInfo{
-				PeerSubject: r.TLS.PeerCertificates[0].Subject.String(),
-			})
+			ctx = ContextWithConnectionInfo(ctx, &CurrentConnectionInfo{
+				PeerSubject:     r.TLS.PeerCertificates[0].Subject.String(),
+				PeerCertificate: r.TLS.PeerCertificates[0],
+			})
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rpc/server.go` around lines 971 - 974, handleCalls only sets PeerSubject
in the context for unary RPCs, so ConnectionInfo(ctx).PeerCertificate is nil for
v.Call(...) (used by RefreshCertificate); update the unary path in handleCalls
to populate the same CurrentConnectionInfo fields as the streaming path: set
PeerSubject to r.TLS.PeerCertificates[0].Subject.String() and PeerCertificate to
r.TLS.PeerCertificates[0] (use the same ContextWithConnectionInfo and
CurrentConnectionInfo types), and add defensive checks that r.TLS != nil and
len(r.TLS.PeerCertificates) > 0 before accessing index 0 to avoid panics.

@phinze phinze left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude and I went through this end to end. The core trick is great: the stale cert is broken for serving but still fine for authenticating, so the runner uses its broken credential to ask for a fixed one. Auth chain is solid, fails closed everywhere, and the comments explain the why at every weird spot.

Worth noting the division of labor with the infra side: pinning addresses over there retires the unrecoverable direction (coordinator drift, where nothing can be negotiated with a coordinator you can't find), while this PR self-heals the recoverable one — and repairs the runners that already drifted, which pinning alone freezes in the broken state.

One thing to keep in mind: MIR-1225 means CN uniqueness isn't enforced at Join, and this auth chain leans entirely on CN. Details inline. Filed MIR-1227 for the renewal follow-up. Nothing blocking. 👍

return nil, fmt.Errorf("runner identity is ambiguous")
}

ips, dnsNames := buildRunnerSANs(listenAddr)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listenAddr is the one caller-controlled input in the issued cert, so a runner cert is now a forever-renewable ticket for arbitrary SANs (own CN, so audit trail stays honest). Fine for the compromised-runner threat model, but it composes badly with MIR-1225: a stolen CN via reusable invite becomes renewable instead of expiring in a year. Fixing 1225 at Join closes it; just flagging the blast radius until then.

Comment thread servers/runner/registration.go Outdated
default:
s.Log.Warn("RefreshCertificate rejected: runner identity is ambiguous",
"subject", peer.Subject.String(), "matches", len(matches))
return nil, fmt.Errorf("runner identity is ambiguous")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail-closed is right. Side effect: a prefix collision = runner fatally can't start after an IP change, re-join only escape. Birthday math says never, so this is fine — your full-runner-ID-in-cert follow-up is the real fix, and we've noted on MIR-1225 that it should ride along with that work.

// SAN matching the host of listenAddr: an IP SAN for an IP host, or a DNS SAN
// for a hostname. This mirrors how the coordinator builds the certificate's SANs
// from the listen address.
func certCoversListenAddr(certPEM, listenAddr string) (bool, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expiry cliff: once a cert expires, it can't authenticate its own refresh, and it's manual re-join time. Since we're already parsing the cert here, also refreshing when near NotAfter (while it can still vouch for itself) is a cheap win ahead of periodic renewal. Filed as MIR-1227.

Comment thread servers/runner/registration_test.go Outdated
})

t.Run("ambiguous runner identity is rejected", func(t *testing.T) {
// Two runners whose IDs share the same 8-char prefix produce the same

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun irony: this fixture is built on exactly the non-uniqueness MIR-1225 removes. Whoever fixes 1225 will need to seed the second node via the entity store instead.

ca, err := caauth.New(caauth.Options{
CommonName: "test-ca",
Organization: "test",
ValidFor: 24 * time.Hour,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. The footgun that hid this is caauth.New happily minting a born-expired CA on zero ValidFor — a validation error there would've caught it years ago.

Resolve conflicts in api/runner/rpc.yml and the generated rpc.gen.go:
main added WorkloadIssuerInfo (index 6) and IssueWorkloadToken (index 7)
to RunnerRegistration, colliding with RefreshCertificate at index 6.
Renumbered RefreshCertificate to index 8 and regenerated the RPC code.

main also changed runnerCertName to embed the full runner ID (instead of
an 8-char prefix) to prevent CommonName collisions during workload-token
authorization. Dropped this branch's duplicate prefix-based runnerCertName
and reworked the RefreshCertificate membership check to match the Node by
exact runner ID from the certificate CommonName (via findNodeByQuery),
removing the now-unnecessary prefix matching, the ambiguity case, and the
findNodesByRunnerIDPrefix helper.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@servers/runner/registration.go`:
- Around line 767-780: resolveSandboxApp can return empty string on lookup
error/missing data but the code proceeds to IssueTokenWithOptions, allowing
token issuance with an empty app subject; change the flow to fail closed by
checking the returned appName immediately after calling resolveSandboxApp (the
appName variable from resolveSandboxApp(ctx, sandboxID)) and if appName == ""
log an error and return an appropriate error response before calling
s.WorkloadIssuer.IssueTokenWithOptions; apply the same guard around the other
token issuance site that also calls resolveSandboxApp and IssueTokenWithOptions
so both code paths abort when app identity resolution fails.
- Around line 441-452: The RefreshCertificate path currently uses
findNodeByQuery to validate runner registration, but that helper does fuzzy
matching (name/entity/short ID) and can incorrectly accept a cert; change the
logic to perform an exact RunnerId match: replace the findNodeByQuery usage for
this check with a precise lookup or an additional equality check (node.RunnerId
== runnerID) and fail if it doesn't match. Specifically update the
RefreshCertificate flow around the call to findNodeByQuery so that the returned
node is only accepted when node.RunnerId exactly equals the parsed runnerID (and
return the same error/logging path otherwise).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ab7f923-e68b-4d47-9ebd-cee299183100

📥 Commits

Reviewing files that changed from the base of the PR and between 6371d96 and e4b7133.

📒 Files selected for processing (4)
  • api/runner/rpc.yml
  • api/runner/runner_v1alpha/rpc.gen.go
  • servers/runner/registration.go
  • servers/runner/registration_test.go
💤 Files with no reviewable changes (1)
  • servers/runner/registration_test.go
✅ Files skipped from review due to trivial changes (1)
  • api/runner/runner_v1alpha/rpc.gen.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/runner/rpc.yml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@servers/runner/registration.go`:
- Around line 767-780: resolveSandboxApp can return empty string on lookup
error/missing data but the code proceeds to IssueTokenWithOptions, allowing
token issuance with an empty app subject; change the flow to fail closed by
checking the returned appName immediately after calling resolveSandboxApp (the
appName variable from resolveSandboxApp(ctx, sandboxID)) and if appName == ""
log an error and return an appropriate error response before calling
s.WorkloadIssuer.IssueTokenWithOptions; apply the same guard around the other
token issuance site that also calls resolveSandboxApp and IssueTokenWithOptions
so both code paths abort when app identity resolution fails.
- Around line 441-452: The RefreshCertificate path currently uses
findNodeByQuery to validate runner registration, but that helper does fuzzy
matching (name/entity/short ID) and can incorrectly accept a cert; change the
logic to perform an exact RunnerId match: replace the findNodeByQuery usage for
this check with a precise lookup or an additional equality check (node.RunnerId
== runnerID) and fail if it doesn't match. Specifically update the
RefreshCertificate flow around the call to findNodeByQuery so that the returned
node is only accepted when node.RunnerId exactly equals the parsed runnerID (and
return the same error/logging path otherwise).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3ab7f923-e68b-4d47-9ebd-cee299183100

📥 Commits

Reviewing files that changed from the base of the PR and between 6371d96 and e4b7133.

📒 Files selected for processing (4)
  • api/runner/rpc.yml
  • api/runner/runner_v1alpha/rpc.gen.go
  • servers/runner/registration.go
  • servers/runner/registration_test.go
💤 Files with no reviewable changes (1)
  • servers/runner/registration_test.go
✅ Files skipped from review due to trivial changes (1)
  • api/runner/runner_v1alpha/rpc.gen.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/runner/rpc.yml
🛑 Comments failed to post (2)
servers/runner/registration.go (2)

441-452: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use an exact RunnerId lookup for refresh authorization.

Line 446 reuses findNodeByQuery, but that helper also matches node name, entity ID, and short ID. A CA-signed runner-<uuid> cert can therefore pass this "still registered" check even after its own node is gone if some other live node happens to match that UUID on one of those fields. This path needs an exact node.RunnerId == runnerID match, not the fuzzy operator lookup.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/runner/registration.go` around lines 441 - 452, The
RefreshCertificate path currently uses findNodeByQuery to validate runner
registration, but that helper does fuzzy matching (name/entity/short ID) and can
incorrectly accept a cert; change the logic to perform an exact RunnerId match:
replace the findNodeByQuery usage for this check with a precise lookup or an
additional equality check (node.RunnerId == runnerID) and fail if it doesn't
match. Specifically update the RefreshCertificate flow around the call to
findNodeByQuery so that the returned node is only accepted when node.RunnerId
exactly equals the parsed runnerID (and return the same error/logging path
otherwise).

767-780: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail closed when app identity resolution fails.

Line 770 keeps going even though resolveSandboxApp returns "" for any missing data or lookup error. Since the app name becomes part of the workload token subject, issuance should stop here instead of depending on IssueTokenWithOptions to reject an empty app.

Suggested guard
 	appName := s.resolveSandboxApp(ctx, sandboxID)
+	if appName == "" {
+		results.SetError("failed to resolve sandbox app identity")
+		return nil
+	}

Also applies to: 859-889

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@servers/runner/registration.go` around lines 767 - 780, resolveSandboxApp can
return empty string on lookup error/missing data but the code proceeds to
IssueTokenWithOptions, allowing token issuance with an empty app subject; change
the flow to fail closed by checking the returned appName immediately after
calling resolveSandboxApp (the appName variable from resolveSandboxApp(ctx,
sandboxID)) and if appName == "" log an error and return an appropriate error
response before calling s.WorkloadIssuer.IssueTokenWithOptions; apply the same
guard around the other token issuance site that also calls resolveSandboxApp and
IssueTokenWithOptions so both code paths abort when app identity resolution
fails.

@evanphx evanphx merged commit 0589c6a into main Jun 6, 2026
19 checks passed
@evanphx evanphx deleted the mir-1223-miren-app-run-and-sandbox-exec-fail-on-garden branch June 6, 2026 20:50
@miren-code-agent miren-code-agent Bot mentioned this pull request Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants